In this HTML document, we explored the Gapminder dataset and generated the following analyses and report:
suppressPackageStartupMessages({
library(tidyverse)
library(plotly)
library(multcomp)
library(DT)
})
We first read in the data
gapminder_clean <- read_csv("gapminder_clean.csv", show_col_types = FALSE)
These are the column names to the data table:
colnames(dplyr::select(gapminder_clean, -"...1"))
[1] "Country Name"
[2] "Year"
[3] "Agriculture, value added (% of GDP)"
[4] "CO2 emissions (metric tons per capita)"
[5] "Domestic credit provided by financial sector (% of GDP)"
[6] "Electric power consumption (kWh per capita)"
[7] "Energy use (kg of oil equivalent per capita)"
[8] "Exports of goods and services (% of GDP)"
[9] "Fertility rate, total (births per woman)"
[10] "GDP growth (annual %)"
[11] "Imports of goods and services (% of GDP)"
[12] "Industry, value added (% of GDP)"
[13] "Inflation, GDP deflator (annual %)"
[14] "Life expectancy at birth, total (years)"
[15] "Population density (people per sq. km of land area)"
[16] "Services, etc., value added (% of GDP)"
[17] "pop"
[18] "continent"
[19] "gdpPercap"
Instructions: Filter the data to include only rows where
Yearis1962and then make a scatter plot comparingCO2 emissions (metric tons per capita)andgdpPercapfor the filtered data.
We started by filtering the dataset to Year == 1962:
gapminder_clean.filtered <- gapminder_clean %>%
filter(Year == 1962)
The following is a scatter plot of CO2 emissions-gdpPercap in year 1962, with x = CO2 emissions, and y = gdpPercap. By observation, the data fits a linear model, so we also plotted it on the diagram:
plot <- gapminder_clean.filtered %>%
ggplot(aes(x = `CO2 emissions (metric tons per capita)`,
y = gdpPercap)) +
geom_point(na.rm = TRUE) +
scale_x_log10() +
scale_y_log10() +
geom_smooth(method = "lm", na.rm = TRUE)
ggplotly(plot, height = 350, width = 600)
`geom_smooth()` using formula = 'y ~ x'
Instructions: On the filtered data, calculate the correlation of
CO2 emissions (metric tons per capita)andgdpPercap. What is the correlation and associated p value?
To determine the Pearson correlation coefficient, we should a priori ensure the data is normally distributed. This is implemented by performing the Shapiro-Wilk normality test on our x’s and y’s. (A p-value < 0.05 indicates that our data fit a normal distribution).
shapiro.test(gapminder_clean.filtered$`CO2 emissions (metric tons per capita)`)
Shapiro-Wilk normality test
data: gapminder_clean.filtered$`CO2 emissions (metric tons per capita)`
W = 0.45714, p-value < 2.2e-16
shapiro.test(gapminder_clean.filtered$gdpPercap)
Shapiro-Wilk normality test
data: gapminder_clean.filtered$gdpPercap
W = 0.38638, p-value < 2.2e-16
Both x’s and y’s fit a normal distribution. Thus, a Pearson correlation coefficient could be calculated. Otherwise, a rank-based correlation test such as the Spearman correlation test should be used.
cor <- cor.test(
x = gapminder_clean.filtered$`CO2 emissions (metric tons per capita)`,
y = gapminder_clean.filtered$gdpPercap,
method="pearson",
use = "complete.obs")
cor
Pearson's product-moment correlation
data: gapminder_clean.filtered$`CO2 emissions (metric tons per capita)` and gapminder_clean.filtered$gdpPercap
t = 25.269, df = 106, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8934697 0.9489792
sample estimates:
cor
0.9260817
Findings: The
CO2 emissions (metric tons per capita)andgdpPercapare positively correlated. The Pearson correlation coefficient is 0.9260817, which means that as GDP per capital rises, so does CO2 emission, and vice versa. The p-value is 1.1286792^{-46}, which negates the null hypothesis that the relationship is a result of pure chance.
Instructions: On the unfiltered data, answer “In what year is the correlation between
'CO2 emissions (metric tons per capita)'andgdpPercapthe strongest?” Filter the dataset to that year for the next step...
We grouped the dataset by year, calculating the Pearson
correlation coefficient between CO2 emissions and
gdpPercap of each. The results were visualized by a line
plot.
gapminder_clean.cor_by_year <- gapminder_clean %>%
group_by(Year) %>%
summarize(cor = cor(`CO2 emissions (metric tons per capita)`, gdpPercap, use = "complete.obs")) %>%
arrange(desc(cor))
plot <- gapminder_clean.cor_by_year %>%
ggplot(aes(x = Year, y = cor)) +
geom_line() +
geom_point()
ggplotly(plot, height = 350, width = 600)
Findings: Year 1967 witnessed the strongest correlation of
CO2 emissionsandgdpPercap
We then filtered the dataset accordingly:
gapminder_clean.filtered <- gapminder_clean %>%
filter(Year== gapminder_clean.cor_by_year[[1,1]])
Instructions: Using
plotly, create an interactive scatter plot comparing'CO2 emissions (metric tons per capita)'andgdpPercap, where the point size is determined bypop(population) and the color is determined by thecontinent. You can easily convert anyggplotplot to aplotlyplot using theggplotly()command.
The following are the desired plots:
plot <- gapminder_clean.filtered %>%
ggplot(aes(x = `CO2 emissions (metric tons per capita)`, y = gdpPercap)) +
geom_point(aes(color = continent, size = pop, text = paste0("country: ", `Country Name`)), na.rm = TRUE) +
scale_x_log10() +
scale_y_log10()
Warning: Ignoring unknown aesthetics: text
plot
ggplotly(plot, height = 350, width = 600)
Instructions: What is the relationship between
continentand'Energy use (kg of oil equivalent per capita)'? (stats test needed)
We visualized the data by plotting continent against
Energy use on a boxplot:
plot <- gapminder_clean %>%
subset(!is.na(continent)) %>%
ggplot(aes(x = continent, y = `Energy use (kg of oil equivalent per capita)`, fill = continent)) +
geom_boxplot(na.rm = TRUE)
ggplotly(plot, height = 350, width = 600)
Since we are manipulating over quantitative data
(Energy use) among multiple groups
(continent), it is worth exploring whether the outcome
variable (Energy use) differ among these groups or not.
There are more than 2 groups being compared, with only one outcome
variable. Thus, an ANOVA is suitable for our case.
We perform an ANOVA with the following code:
res_aov <- aov(`Energy use (kg of oil equivalent per capita)` ~ continent, data = gapminder_clean)
summary(res_aov)
Df Sum Sq Mean Sq F value Pr(>F)
continent 4 7.715e+08 192870621 51.46 <2e-16 ***
Residuals 843 3.160e+09 3748033
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
1759 observations deleted due to missingness
The ANOVA gave a p-value of 8.5270035^{-39}, which sucessfully negated the null hypothesis that there is no difference among group means.
To find out exactly which groups are different from others, it’s useful to conduct a post hoc test. A Tukey HSD test is commonly used to compare all groups to each other (i.e. all possible comparisons of 2 groups), sparing the need to set a reference group.
res_tukey <- TukeyHSD(res_aov, conf.level = 0.95)
res_tukey
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = `Energy use (kg of oil equivalent per capita)` ~ continent, data = gapminder_clean)
$continent
diff lwr upr p adj
Americas-Africa 1005.1037 466.8326 1543.3748 0.0000041
Asia-Africa 1168.7636 628.2529 1709.2742 0.0000000
Europe-Africa 2447.5453 1947.3838 2947.7067 0.0000000
Oceania-Africa 3281.7976 2040.3410 4523.2543 0.0000000
Asia-Americas 163.6599 -384.4160 711.7357 0.9256447
Europe-Americas 1442.4416 934.1141 1950.7691 0.0000000
Oceania-Americas 2276.6940 1031.9249 3521.4630 0.0000069
Europe-Asia 1278.7817 768.0833 1789.4801 0.0000000
Oceania-Asia 2113.0341 867.2950 3358.7732 0.0000402
Oceania-Europe 834.2524 -394.5176 2063.0223 0.3421942
par(mar = c(5.1, 8.1, 4.1, 0.1)) # set margins of plot
plot(res_tukey,
las = 1) # labels horizontal to the axes
The plot above depicts the 95% confidence interval (95% CI) of the
differences in mean levels of continents. According to the plot, the 95%
CI of Asia-Americas and Oceania-Europe
overlapped with 0, meaning that we couldn’t effectively tell that these
groups have differences between the former and the latter country. All
other groups had significant differences.
These are the statistical values from the Tukey HSD test:
data_frame(Group = row.names(res_tukey$continent),
as_data_frame(res_tukey$continent)) %>%
arrange(desc(`p adj`)) %>%
mutate(across(where(is.numeric), round, 4)) %>%
datatable()
Findings: There is significant difference between the group means of
'Energy use (kg of oil equivalent per capita)'across thecontinents. However, a pairwise post-hoc test reveals that it’s not the case betweenAsia-AmericasandOceania-Europe.
Instructions: Is there a significant difference between Europe and Asia with respect to
'Imports of goods and services (% of GDP)'in the years after 1990? (stats test needed)
We filtered the dataset according to the instructions and drew a boxplot:
filtered.gapminder_clean <- gapminder_clean %>%
filter(continent %in% c("Europe", "Asia"), Year >= 1990)
plot <- filtered.gapminder_clean %>%
ggplot(aes(x = continent, y = `Imports of goods and services (% of GDP)`, fill = continent)) +
geom_boxplot(na.rm = TRUE)
ggplotly(plot, height = 350, width = 600)
As can be seen in the boxplot and the following table,
Asia has a slightly higher mean of
Imports of goods and services compared to
Europe.
filtered.gapminder_clean %>% group_by(continent) %>%
summarize(mean = mean(`Imports of goods and services (% of GDP)`, na.rm = TRUE),
IQR = IQR(`Imports of goods and services (% of GDP)`, na.rm = TRUE)) %>%
mutate(across(where(is.numeric), round, 4)) %>%
datatable()
Since we are comparing two groups of quantitative data, a
two-sample t-test could be useful. This test holds the null
hypothesis that the mean of Imports of goods and services
does not differ between the two groups.
res_t <- t.test(formula = `Imports of goods and services (% of GDP)` ~ continent,
data = filtered.gapminder_clean)
res_t
Welch Two Sample t-test
data: Imports of goods and services (% of GDP) by continent
t = 1.3552, df = 137.53, p-value = 0.1776
alternative hypothesis: true difference in means between group Asia and group Europe is not equal to 0
95 percent confidence interval:
-2.321099 12.433240
sample estimates:
mean in group Asia mean in group Europe
46.84531 41.78924
Findings: The two-sample t-test gives a p-value of 0.1775691, which is >0.05, meaning that the mean of
Imports of goods and servicesis not statistically different betweenEuropeandAsia.
Instructions: What is the country (or countries) that has the highest
'Population density (people per sq. km of land area)'across all years? (i.e., which country has the highest average ranking in this category across each time point in the dataset?)
To answer this question, the mean of population density over the years is calculated for each country. In the following table, the countries are ordered by the calculated mean (rounded to the nearsert integer), in descending order.
countries_of_highest_density <- gapminder_clean %>%
dplyr::select(Year, `Country Name`, `Population density (people per sq. km of land area)`) %>%
spread(key = Year, value = `Population density (people per sq. km of land area)`) %>%
bind_cols(., Mean = rowMeans(dplyr::select(., -`Country Name`))) %>%
arrange(desc(Mean)) %>%
relocate(Mean, .after = `Country Name`)
countries_of_highest_density %>% mutate(across(where(is.numeric), round, 0)) %>% datatable(options = list(scrollX = TRUE))
From the table above, Macao SAR, China, Monaco, Hong Kong SAR, China, Singapore, Gibraltar, Bermuda have the top six highest average population density. We could further draw a line plot to observe the changes of population density in these countries over years.
plot <- gapminder_clean %>%
filter(`Country Name` %in% head(countries_of_highest_density)$`Country Name`) %>%
ggplot(aes(x = Year, y = `Population density (people per sq. km of land area)`, color = `Country Name`)) +
geom_line(na.rm = TRUE) +
geom_point(na.rm = TRUE) +
theme(legend.position = "none")
ggplotly(plot, height = 350, width = 600)
Findings: Macao SAR, China, Monaco are the top 2 in average
Population density (people per sq. km of land area)across all years.
Instructions: What country (or countries) has shown the greatest increase in ‘Life expectancy at birth, total (years)’ between 1962 and 2007?
To answer the question, we first transform the table into
Country name x Year, and subtract the whole
table by value of Life expectancy at birth, total (years)
in 1962.
selected_col.gapminder_clean <- gapminder_clean %>%
dplyr::select(Year, `Country Name`, `Life expectancy at birth, total (years)`) %>%
spread(key = `Year`, value = `Life expectancy at birth, total (years)`)
selected_col.gapminder_clean <- data_frame(
`Country Name` = selected_col.gapminder_clean$`Country Name`,
dplyr::select(selected_col.gapminder_clean, -`Country Name`) - selected_col.gapminder_clean$`1962` # Subtract the whole table by value in 1962
) %>%
gather(-`Country Name`, key = Year, value = `Life expectancy at birth, total (years)`)
The table below shows the difference of
Life expectancy at birth, total (years) between 2007 and
1962, arranged in descending order, of each country:
countries_of_greatest_life_expectancy_increase <- selected_col.gapminder_clean %>%
dplyr::select(Year, `Country Name`, `Life expectancy at birth, total (years)`) %>%
spread(key = `Year`, value = `Life expectancy at birth, total (years)`) %>%
dplyr::select(`Country Name`, `Difference between 2007 and 1962` = `2007`) %>%
mutate(across(where(is.numeric), round, 4)) %>%
arrange(desc(`Difference between 2007 and 1962`))
countries_of_greatest_life_expectancy_increase %>% datatable()
The line plot shows the top six countries regarding life expectancy growth. Year 1962 of each country is set to a baseline 0 for better comparison.
plot <- selected_col.gapminder_clean %>%
subset(`Country Name` %in% head(countries_of_greatest_life_expectancy_increase)$`Country Name`) %>%
ggplot(aes(x = Year, y = `Life expectancy at birth, total (years)`, color = `Country Name`)) +
geom_line(aes(group = `Country Name`), na.rm = TRUE) +
geom_point(na.rm = TRUE) +
labs(y = "Difference of life expectancy at birth, total (years)") +
theme(legend.position = "none")
ggplotly(plot, height = 350, width = 600)
Findings: Maldives has shown the greatest increase in
Life expectancy at birth, total (years)between 1962 and 2007.
sessionInfo()
R version 4.2.0 (2022-04-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS 13.2.1
Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] DT_0.27 multcomp_1.4-23 TH.data_1.1-1 MASS_7.3-58.3
[5] survival_3.5-5 mvtnorm_1.1-3 ggpubr_0.6.0 plotly_4.10.1
[9] lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0 purrr_1.0.1
[13] tibble_3.2.0 ggplot2_3.4.1 tidyverse_2.0.0 readr_2.1.4
[17] tidyr_1.3.0 stringi_1.7.12 kableExtra_1.3.4 dplyr_1.1.0
[21] shinyjs_2.1.0 shiny_1.7.4
loaded via a namespace (and not attached):
[1] nlme_3.1-162 fontawesome_0.5.0 bit64_4.0.5 webshot_0.5.4
[5] httr_1.4.5 R.cache_0.16.0 tools_4.2.0 backports_1.4.1
[9] bslib_0.4.2 utf8_1.2.3 R6_2.5.1 lazyeval_0.2.2
[13] mgcv_1.8-42 colorspace_2.1-0 withr_2.5.0 tidyselect_1.2.0
[17] bit_4.0.5 compiler_4.2.0 cli_3.6.0 rvest_1.0.3
[21] xml2_1.3.3 sandwich_3.0-2 labeling_0.4.2 sass_0.4.5
[25] scales_1.2.1 systemfonts_1.0.4 digest_0.6.31 rmarkdown_2.20
[29] svglite_2.1.1 R.utils_2.12.2 pkgconfig_2.0.3 htmltools_0.5.4
[33] styler_1.9.1 fastmap_1.1.1 highr_0.10 htmlwidgets_1.6.2
[37] rlang_1.1.0 rstudioapi_0.14 jquerylib_0.1.4 farver_2.1.1
[41] generics_0.1.3 zoo_1.8-11 jsonlite_1.8.4 crosstalk_1.2.0
[45] vroom_1.6.1 car_3.1-1 R.oo_1.25.0 magrittr_2.0.3
[49] Matrix_1.5-3 Rcpp_1.0.10 munsell_0.5.0 fansi_1.0.4
[53] abind_1.4-5 lifecycle_1.0.3 R.methodsS3_1.8.2 yaml_2.3.7
[57] carData_3.0-5 grid_4.2.0 parallel_4.2.0 promises_1.2.0.1
[61] crayon_1.5.2 lattice_0.20-45 splines_4.2.0 hms_1.1.2
[65] knitr_1.42 pillar_1.8.1 ggsignif_0.6.4 codetools_0.2-19
[69] glue_1.6.2 evaluate_0.20 data.table_1.14.8 vctrs_0.6.0
[73] tzdb_0.3.0 httpuv_1.6.9 gtable_0.3.2 cachem_1.0.7
[77] xfun_0.37 mime_0.12 xtable_1.8-4 broom_1.0.4
[81] rstatix_0.7.2 later_1.3.0 rsconnect_0.8.29 viridisLite_0.4.1
[85] memoise_2.0.1 timechange_0.2.0 ellipsis_0.3.2